fix(journal): serialize concurrent appends via lockfile#1002
fix(journal): serialize concurrent appends via lockfile#1002carlos-alm wants to merge 6 commits intomainfrom
Conversation
appendJournalEntries used fs.appendFileSync with no cross-process coordination, so a watcher session and a manual codegraph build in a second shell could interleave lines and corrupt .codegraph/changes.journal (truncated DELETED prefixes, partial entries, newline-less tails). Wrap both appendJournalEntries and writeJournalHeader in a withJournalLock helper that: - acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx') (atomic exclusive-create, cross-platform) - stamps the holder PID for stale-lock detection - retries every 25ms up to a 5s timeout - steals the lock if the holder PID is dead or the lock file is older than 30s (covers crash-mid-write) - always releases in finally Zero new dependencies — uses only node:fs and Atomics.wait on a SharedArrayBuffer for synchronous retry sleep, preserving the existing sync API. Fixes #996 Impact: 7 functions changed, 12 affected
|
Claude finished @carlos-alm's task —— View job PR Review: fix(journal): serialize concurrent appends via lockfileTodo List
Root Cause AnalysisProblem Being Solved: Race condition where concurrent journal writers (watch session + manual Why This Approach: Uses atomic file locking via Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup. Backlog Compliance
Technical ReviewImplementation Quality ⭐⭐⭐⭐☆Strong Points:
Implementation Details:
Test Coverage ⭐⭐⭐⭐☆New Tests Added:
Test Quality:
Potential Issues (Minor)
Critical ConcernsNone. This is a solid fix for a real concurrency issue. Final Recommendation
The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent. |
Greptile SummaryThis PR serializes concurrent journal writers using an advisory lockfile ( Confidence Score: 4/5Safe to merge for typical usage; two residual P2 races only manifest under resource exhaustion (EMFILE) or I/O errors combined with concurrent writers. All three prior P1s are resolved. Remaining findings are P2: a self-lockout under EMFILE when openSync('r+') fails after a successful steal rename, and an unguarded unlinkSync in the writeSync-failure handler that could delete a concurrent stealer's live lock under ENOSPC. Both require unusual OS-level failure modes on top of concurrent writers, so they don't block merge but are worth a follow-up fix. src/domain/graph/journal.ts — trySteal (lines 84–91) and acquireJournalLock writeSync-failure handler (lines 106–115) Important Files Changed
Sequence DiagramsequenceDiagram
participant W as Watcher
participant B as Build
participant FS as Filesystem
W->>FS: openSync('wx') → fd [EEXIST if B holds lock]
W->>FS: writeSync(fd, PID+nonce)
Note over W: acquireJournalLock ✓
B->>FS: openSync('wx') → EEXIST
B->>FS: readFileSync(lock) → W's PID
B->>FS: isPidAlive(W.PID) → true
B->>FS: statSync(lock) → mtime fresh
Note over B: shouldSteal=false → sleepSync(25ms) → retry loop
W->>FS: appendFileSync(journal, lines)
Note over W: fn() executes
W->>FS: readFileSync(lock) → nonce match
W->>FS: unlinkSync(lock)
Note over W: releaseJournalLock ✓
B->>FS: openSync('wx') → fd [lock now free]
B->>FS: writeSync(fd, PID+nonce)
Note over B: acquireJournalLock ✓
B->>FS: appendFileSync(journal, lines)
B->>FS: unlinkSync(lock)
Note over B: releaseJournalLock ✓
Reviews (5): Last reviewed commit: "fix(journal): sweep orphaned lockfile .t..." | Re-trigger Greptile |
| let holderAlive = true; | ||
| try { | ||
| const pidContent = fs.readFileSync(lockPath, 'utf-8').trim(); | ||
| holderAlive = isPidAlive(Number(pidContent)); | ||
| } catch { | ||
| /* unreadable — fall through to age check */ | ||
| } | ||
|
|
||
| if (!holderAlive) { | ||
| try { | ||
| fs.unlinkSync(lockPath); | ||
| } catch { | ||
| /* another writer stole it first */ | ||
| } | ||
| continue; |
There was a problem hiding this comment.
TOCTOU race allows two processes to concurrently enter the critical section
When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:
- Process A dies — lockfile contains dead PID A.
- Processes B and C both call
openSync('wx'), both getEEXIST, and both read dead PID A. - B unlinks and immediately re-acquires (new inode I1,
fn()starts). - C now executes
unlinkSync(lockPath)— it unlinks B's live lockfile (I1), not the stale one. - C loops, calls
openSync('wx'), succeeds (new inode I2), and itsfn()begins concurrently with B's.
Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.
The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).
There was a problem hiding this comment.
Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.
| function sleepSync(ms: number): void { | ||
| const buf = new Int32Array(new SharedArrayBuffer(4)); | ||
| Atomics.wait(buf, 0, 0, ms); | ||
| } |
There was a problem hiding this comment.
Atomics.wait freezes the Node.js event loop during lock contention
Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.
A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():
function sleepSync(ms: number): void {
const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
while (process.hrtime.bigint() < end) { /* spin */ }
}This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).
There was a problem hiding this comment.
Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.
Codegraph Impact Analysis12 functions changed → 14 callers affected across 5 files
|
Address two Greptile review issues on the journal lockfile:
- P1 TOCTOU: when two stealers observed the same stale (dead-PID)
holder, one's unlink could cross the other's fresh openSync('wx')
acquisition, admitting both writers into the critical section. Replace
the unlink + openSync('wx') pattern with an atomic write-tmp + rename
steal, then verify via a random nonce. If another stealer's rename
landed after ours, we bail and retry instead of unlinking their live
lockfile. Release now also nonce-verifies before unlinking.
- P2 event-loop blockage: replace Atomics.wait with a short hrtime
busy-spin so pending FS events and timer callbacks in the watcher
keep firing during the 25ms retry window.
Add a regression test for the stale-lock steal race that asserts we
never unlink a lockfile whose nonce does not match our own.
Impact: 7 functions changed, 10 affected
| try { | ||
| const fd = fs.openSync(lockPath, 'wx'); | ||
| try { | ||
| fs.writeSync(fd, `${process.pid}\n${nonce}\n`); | ||
| } catch { | ||
| /* PID stamp is advisory; fd is still exclusive */ | ||
| } | ||
| return { fd, nonce }; | ||
| } catch (e) { | ||
| if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e; | ||
| } |
There was a problem hiding this comment.
Silent
writeSync failure voids the nonce and breaks mutual exclusion
If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:
releaseJournalLockreads the empty file,content.includes(lock.nonce)isfalse, and the lockfile is orphaned rather than unlinked.- Any concurrent waiter reads the empty file, computes
Number('') → 0, then callsisPidAlive(0)which returnsfalse(thepid <= 0guard). WithholderAlive = false, it immediately callstrySteal— stealing a live, active lock and breaking mutual exclusion.
The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.
Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.
There was a problem hiding this comment.
Fixed in e17da0a. When the nonce stamp writeSync throws, acquireJournalLock now closes the fd, unlinks the empty lockfile, and retries instead of returning with an un-stamped lockfile. This prevents concurrent waiters from reading an empty nonce, treating isPidAlive(0) as dead, and stealing the live lock.
Integrates main's appendJournalEntriesAndStampHeader and mkdirSync cleanup with this PR's lockfile serialization. The new function is also wrapped in withJournalLock so it serializes with appendJournalEntries and writeJournalHeader against the same journal file. Also addresses Greptile P1: silent writeSync failure in acquireJournalLock now releases the fd + unlinks the empty lockfile and retries, preventing empty-lockfile mutual-exclusion breakage. Impact: 10 functions changed, 16 affected
…ntry (#1002) Addresses Greptile P2: crash-mid-steal in trySteal leaves .codegraph/changes.journal.lock.<nonce>.tmp files behind. Without cleanup they accumulate silently across crash cycles. Adds sweepStaleTmpFiles called at the top of withJournalLock which removes any changes.journal.lock.*.tmp older than LOCK_STALE_MS. The age filter avoids racing an in-flight steal on another process. Impact: 2 functions changed, 10 affected
|
Addressed Greptile's review feedback from the 5/5 review:
|
Summary
Fixes #996.
appendJournalEntries()relied onfs.appendFileSyncalone — that's atomic within a single call but not across concurrent writers. When a watch session and a manualcodegraph buildin another shell both write to.codegraph/changes.journal, their lines can interleave, producing truncatedDELETEDprefixes, partial paths, or a\n-less tail. A corrupted header then makesreadJournalreturnvalid: falseand silently falls through to a hash scan — a silent perf regression, not a crash.Fix
Wrap both
appendJournalEntriesandwriteJournalHeaderinwithJournalLock:.codegraph/changes.journal.lockwithfs.openSync(path, 'wx')— atomic exclusive-create on ext4/NTFS/APFS.process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.finally.Zero new dependencies.
Atomics.waiton aSharedArrayBufferprovides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).Test plan
npx vitest run tests/unit/journal.test.ts— 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts— 27/27 passnpm run typecheck— clean